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Abstract 

Spread regression is an extension of linear regression that allows for the inclusion of a predictor 
that contains information about the variance. It can be used to take the information from a weather 
forecast ensemble and produce a probabilistic prediction of future temperatures. There are a number 
of ways that spread regression can be formulated in detail. We perform an empirical comparison 
of four of the most obvious methods applied to the calibration of a year of ECMWF temperature 
forecasts for London Heathrow. 

1 Introduction 

There is considerable demand within industry for probabilistic forecasts of temperature, particularly from 
industries that routinely use probabilistic analysis such as insurance, finance and energy. However there 
is considerable disagreement among meteorologists about how such forecasts should be produced and at 
present no adequately calibrated probabilistic forecasts are available commercially. Those who need to 
use probabilistic forecasts have to make them themselves. 

How, then, should probabilistic forecasts of temperature be m ade? A number of v e ry different methods 
have b een s uggested in th e litera ture such as those described in lMvlne et alJ l)2002(l . lRoulston and Smithl 
l)2003|) and lRafterv et alJ l|2003(h However it seems that all three of these methods, although complex, 
suffer from the shortcoming that they don't calibrate the amplitude of variations in the ensemble spread 
but rather leave the amplitude to be determined as a by-product of the calibration of the mean. 
We take a very different, and simpler, approach to the development of probabilistic forecasts than the 
authors cited above. Our approach is based on the following philosophy: 

• The baseline for comparison for all probabilistic temperature forecasts should be a distribution 
derived very simply by using linear regression around a single forecast or an ensemble mean. 

• More complex methods can then be tested against this baseline. Before anything more complex 
than linear regression is adopted on an operational basis it should be shown to clearly beat linear 
regression in out of sample tests. Unfortunately none of the studies cited above compared the 
methods they proposed with linear regression, and, given that they seem not to calibrate the 
ensemble spread correctly, it would seem possible that they might not perform as well. 

We have followed this philosophy and, based on our analysis of one particular dataset of past forecasts 
and past observations we have shown that: 

• Moving from constant-parameter linear regression to seasonal parameter linear regression gives a 
huge improveme nt in forecast sk ill for forecasts of both the mean temperature and the distribution 
of temperatures ijjewsonl l2004al) 

• Adding spread as a predictor gives only a very small improvement l|Jewson et alJ l)2003|) , IJewsonI 
J2003bM 

• Generalising to allow for non- normality gives no improvement at all ijjewsorl [2.003a) . 
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All these results are summarised and discussed in lJewsonl (|2004q) . 

In this article we focus on the second of these conclusions: that using the spread as an extra predictor 
brings only a very small improvement to forecast skill. This is somewhat disappointing given that it 
had been hoped by some that use of the ensemble spread would turn out to be an important factor in 
the creation of probabilistic forecasts. We are trying to get a be tter und e rstand ing of why the ensemble 
spread brings so little benefit in the tests we have performed. In I Jewsonl l)2004bl) we concluded that this 
is because of: 

1. The scoring system we use. 

We ca l ibrate and scor e probabilistic forecasts using the likelihood of classical statistics l)Fisherl 
l|l912h . I.Iewsonl l)2003ci) s |. Likelihood, as we have used it, is a measure that considers the ability 
of the forecast to predict the whole distribution of future temperatures. Much of the mass in the 
distribution of temperature is near the mean and so the likelihood naturally tends to emphasize the 
importance of the mean rather than the spread. If we were to use a score that puts more weight 
onto the tails of the distribution then the spread might prove more important (although such a 
score would not then reflect our main interest, which is in the prediction of the whole distribution). 

2. The low values of the coefficient of variation of spread (COVS). 

Once we have calibrated our ensemble forecast data we find that the uncertainty does not vary very 
much relative to the mean level of the uncertainty (i.e. the COVS is low). Thus if we approximate 
the uncertainty with a constant this does not degrade the forecast to any great extent, and we 
have not been able to detect a significant impact of the spread in out of sample testing. That the 
variations in the calibrated uncertainty are small could be either because the actual uncertainty 
does not vary very much or because the ensemble spread is not a good predictor for the actual 
uncertainty. In fact it is likely to be a combination of these two effects. 

3. The low values of the spread mean variability ratio (SMVR). 

We have also found that the amplitude of the variations in the uncertainty in the calibrated forecast 
is small relative to the amplitude of the variations in the mean temperature (i.e. the SMVR is low). 
As a result accurate prediction of the (small) variations in the uncertainty is not very important 
relative to accurate prediction of the (large) variations in the mean temperature. 

However in addition to these reasons it is also possible that we have been using the ensemble spread 
wrongly in our predictions. The mode l we have been using represents the unknown uncertainty a as a 
linear function of the ensemble spread IjJewson et all l2003j) : 



But this model is entirely ad-hoc. Why a linear function? We chose linear because it is the simplest way 
to calibrate both the mean uncertainty and the amplitude of the variability of the uncertainty, and not 
on the basis of any theory or analysis of the empirical spread-skill relationship. This suggests it is very 
important to test other models to see if they perform any better. 

In this paper we will compare the original spread-regression model with 3 other spread-regression models. 
The four models we compare all have four parameters and so can be compared in-sample. This is 
important because the signals we are looking for are weak and obtaining long stationary series of past 
forecasts is more or less impossible at this point in time. At some point the numerical modellers will 
hopefully start providing long (i.e. multiycar) back-test time series from their models. This will allow 
more thorough out of sample testing of calibration schemes such as the spread-regression model and will 
facilitate the comparison of models with different numbers of parameters: meanwhile we do what we can 
with the limited data available. 

2 Four spread regression models 

The four spread-regression models that we will test are all based on linear regression between anomalies 
of the temperature and anomalies of the ensemble mean: 



a + noise 

S + 7s + noise 



(1) 

(2) 



Ti ~ N{a + f3m l7 a) 



(3) 



The difference between the models is in the representation of a. 



The original standard-deviation-based spread regression model is: 



&i — 7 + Ssi (4) 

The variance-based model is: 

«?=7 a + ^ (5) 
The inverse-standard-deviation-based model is: 

1 & 

-=7+~ (6) 

<Ji Si 

and the inverse-variance-based-model is: 

Following Ijewsonl (2004a) the parameters a, /?, 7, 5 all vary seasonally using a single sinusoid. We fit each 



model by finding the parameters that maximise the likelihood (using numerical methods). 
We note that for very small variations in s all these models can be linearised and end up the same as the 
linear- in-standard-deviation model given in equation^ 



3 Results 

The first and most important test is to see which of the models achieves the greatest log-likelihood at the 
maximum. The results from this test are shown in figure ^ (actually in terms of negative log- likelihood 
so that smaller is better). In each case the spread-regression results (dashed lines) are shown relative to 
results for a constant- variance model (solid line). What we see is that the four models achieve roughly 
the same decrease in the negative log-likelihood and that in none of the cases is the decrease very large 
compared with the change in the log- likelihood from one lead time to the next. These changes are also 
small compared with the chang e in the log-likelihood that was achieved by making the bias correction 
vary seasonally l|Jewsonl l2004a|) . 

Figure |21 shows the same data as is shown in figure ^ but as differences between the spread-regression 
models and the constant-variance model. Again we see that there is little to choose between the models. 
Figure shows a fifty-day sample of the calibrated mean temperature from the constant-variance model 
with the spread-regression calibrated temperatures overlaid. The differences are very small indeed and 
can only really be seen when they are plotted explicitly in figure 0| 

Figure |S| shows the calibrated spread from the constant-variance model and the calibrated spread from 
the four spread-regression models. The uncertainty prediction from the constant variance model varies 
slowly from one season to the next and has a kink because of the presence of missing values in the forecast 
data. We now see rather significant differences between the four spread regression models. The size of 
these differences suggests that the variations in s are not so small that the four spread regression models 
are equivalent to the linear-in-standard-deviation model. 



4 Conclusions 

How to produce good probabilistic temperature forecasts from ensemble forecasts remains a contentious 
issue. This is mainly because of disagreement about how to use the information in the ensemble spread. 
We have compared 4 simple parametric models that convert the spread into an estimate for the forecast 
uncertainty. All the models allow for an offset and a term that scales the amplitude of the variability 
of the uncertainty. Although the four models lead to visible differences in the calibrated spread we have 
found only tiny differences between the impact of these four models on the log-likelihood achieved. Also 
none of the models clearly dominates the others. 
These results lead us to conclude that: 

• the variations in s are not so small that the calibration of the spread can be linearised, which would 
make all four models equivalent 

• but the changes in the calibrated uncertainty are small enough that they do not have a great impact 
on the maximum likelihood achieved in any of the models 

• implying that there is simply not very much information in the variations in the spread 



It is possible that the models are overfitted to a certain extent. This is unavoidable given that we only 
have one year of data for fitting these multiparameter models. That none of the models dominates is 
rather curious: perhaps all the models are equally bad and none of them come close to modelling the 
relationship between spread and skill in a reasonable way. This raises the possibility that better results 
could perhaps be achieved by using other parametrisations. 

It is difficult to see how to make further progress on these questions until longer series of stationary back- 
test data is made available by the numerical modellers. Meanwhile it seems that a pragmatic approach to 
producing probabilistic forecasts would be to stick with the constant variance model since more complex 
models have shown only a small benefit in in-sample testing, and do not show a significant benefit in 
out-of-sample testing. 
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Figure 1: The negative log-likelihood scores achieved by a linear regression (solid line) and four spread- 
regression models (dotted lines). 
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Figure 3: The calibrated mean temperature from linear regression (solid line) and four spread- regression 
models (dotted lines). The dotted lines cannot be distinguished because they are so close to the solid 
lines. 
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